Improvement: keep order-preserving repartitions for streaming aggregates#21107
Improvement: keep order-preserving repartitions for streaming aggregates#21107xudong963 wants to merge 3 commits intoapache:mainfrom
Conversation
I am not sure this is a "bug" necessarily -- more like a tradeoff. I believe the enforce_distribution plan will attempt to increase plan parallelism even if it has to resort by default My understanding is that this is what the prefer_existing_sort setting controls So if you want plans to keep existing sorts and not increase parallelism in that case, you should set |
alamb
left a comment
There was a problem hiding this comment.
Thanks @xudong963 -- this change looks good to me. As I understand it it avoids adding a HashRepartitioning (and uses a more memory efficient operator) so that sounds like a win all around
I was worried that this change would result in trading off "more sortedness" for "less parallelism" but from what I can see that is not the case.
One thing we may want to consider it gating this behavior behind the "prefer existing sort" flag -- that way
I don't think we should be ignoring the errors, but otherwise the code and tests look good to me.
| 10)------------------BoundedWindowAggExec: wdw=[row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Field { "row_number() ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING": UInt64 }, frame: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING], mode=[Sorted] | ||
| 11)--------------------LazyMemoryExec: partitions=1, batch_generators=[range: start=1, end=5, batch_size=8192] | ||
| 03)----RepartitionExec: partitioning=Hash([generated_id@0], 4), input_partitions=4, preserve_order=true, sort_exprs=generated_id@0 ASC NULLS LAST | ||
| 04)------AggregateExec: mode=Partial, gby=[generated_id@0 as generated_id], aggr=[array_agg(unnested.ar)], ordering_mode=Sorted |
There was a problem hiding this comment.
this plan looks better to me (it still is fully parallelized and now uses avoids an unecessary hash repartition
| // Parent is blocking even with ordering — no benefit | ||
| return false; | ||
| } | ||
| // Build parent with an unordered child (simulating CoalescePartitionsExec) |
There was a problem hiding this comment.
this commet is strange to me -- the code adds a CoalescePartitionsExec -- so I don't think it is "simulating" anything
| let with_ordered = | ||
| match Arc::clone(parent).with_new_children(vec![Arc::clone(ordered_child)]) { | ||
| Ok(p) => p, | ||
| Err(_) => return false, |
There was a problem hiding this comment.
I don't think we should ignore the Err here or below as it could mask real errors / a bug with this code
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
@alamb thanks for the review, this PR is not trading parallelism for sortedness, and it does not remove the hash repartition. The plan still uses the same Hash repartition with the same partition count; the only change is that we keep the order-preserving variant when removing it would cause the parent AggregateExec (in Sorted / PartiallySorted mode) to switch from incremental to blocking execution. |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
This PR updates
EnforceDistributionto keep order-preserving repartition variants when preserving input ordering allows the parent operator to remain incremental/streaming.Previously, order-preserving variants could be removed when
prefer_existing_sort = falseor when there was no explicit ordering requirement, even if dropping the ordering would force a parent operator such asAggregateExecto fall back to blocking execution. This change adds a targetedpreserving_order_enables_streamingcheck and uses it to avoid replacingRepartitionExec(..., preserve_order=true)/SortPreservingMergeExecwhen that preserved ordering is what enables streaming behavior.As a result, the optimizer now prefers keeping order-preserving repartitioning in these cases, and the updated sqllogictests reflect the new physical plans: instead of inserting a
SortExecabove a plain repartition, plans now retainRepartitionExec(... preserve_order=true)so sorted or partially sorted aggregates can continue running incrementally.Are these changes tested?
Are there any user-facing changes?
No extra sort needed for these cases